transport/common/hci_h4: propagate OOM instead of asserting#2233
transport/common/hci_h4: propagate OOM instead of asserting#2233gmarull wants to merge 1 commit into
Conversation
When hci_h4_sm_w4_header() or hci_h4_sm_w4_payload() returns -1 because the underlying allocator returned NULL (transport ACL pool exhausted), hci_h4_sm_rx() currently asserts via two `assert(rc >= 0)` checks. This turns a recoverable pool-pressure event into a fatal crash. The condition is reachable in practice on host-side transports such as SiFli SF32LB52: the H4 RX task allocates ACL mbufs from mpool_acl and queues them on ble_hs_rx_q for the host event loop to drain. If the host is parked (e.g. blocked while serving GATT notifications into a downstream buffer that drains slowly) the queue grows until the pool is empty; the next allocation returns NULL and the assert fires. Remove both `assert(rc >= 0)` so -1 propagates through hci_h4_sm_rx() to the transport. Callers can then back off (e.g. block until a buffer is returned to the pool via the put callback) and retry with the same input. The existing `assert(rc < 0)` at the bottom of hci_h4_sm_rx() still enforces the legitimate "consumed nothing => rc indicates error" invariant. Note that when the input source is destructive, callers must retry with the same buffer rather than drop bytes on -1, otherwise the H4 framing desyncs. The state machine already preserves partial state across calls so retry is safe. Signed-off-by: Gerard Marull-Paretas <gerard@teslabs.com>
|
H4 has no way of recovery if data is dropped and this was added on purpose [1] @andrzej-kaczmarek thoughts? [1] 2a0379d |
We're still analyzing this in more detail; we just raised this to get some initial feedback (see the other side here coredevices/PebbleOS#1277). Not all users are experiencing this problem, but some exhaust the ACL pool easily (despite having increased it a few times). |
|
With H4 you either have infinite memory or some sort of flow control (ie for RS232 Bluetooth HCI UART transport layer requires HWFC). No sure how this is on SiFli but if IPC/mailbox is used with H4 ontop than those should have flow control in transport Also, Controller to Host data flow control should probably be used (BLE_HS_FLOW_CTRL) |
When hci_h4_sm_w4_header() or hci_h4_sm_w4_payload() returns -1 because the underlying allocator returned NULL (transport ACL pool exhausted), hci_h4_sm_rx() currently asserts via two
assert(rc >= 0)checks. This turns a recoverable pool-pressure event into a fatal crash.The condition is reachable in practice on host-side transports such as SiFli SF32LB52: the H4 RX task allocates ACL mbufs from mpool_acl and queues them on ble_hs_rx_q for the host event loop to drain. If the host is parked (e.g. blocked while serving GATT notifications into a downstream buffer that drains slowly) the queue grows until the pool is empty; the next allocation returns NULL and the assert fires.
Remove both
assert(rc >= 0)so -1 propagates through hci_h4_sm_rx() to the transport. Callers can then back off (e.g. block until a buffer is returned to the pool via the put callback) and retry with the same input. The existingassert(rc < 0)at the bottom of hci_h4_sm_rx() still enforces the legitimate "consumed nothing => rc indicates error" invariant.Note that when the input source is destructive, callers must retry with the same buffer rather than drop bytes on -1, otherwise the H4 framing desyncs. The state machine already preserves partial state across calls so retry is safe.